26. Wrangling vs. EDA vs. ETL
Wrangling vs. EDA
At another point in the Data Analyst Nanodegree or if you're taking this course individually, another point in your data journey, you will encounter a topic called exploratory data analysis (EDA). If you've encountered EDA already you might be thinking: data wrangling seems very similar to exploratory data analysis. And that's because they are very similar and often get confused.
Here is one definition of EDA: an analysis approach that focuses on identifying general patterns in the data, and identifying outliers and features of the data that might not have been anticipated.
So where does data wrangling end and EDA start?
Data wrangling is about gathering the right pieces of data, assessing your data's quality and structure, then modifying your data to make it clean. But the assessments you make and convert to cleaning operations won't make your analysis, viz, or model better, though. The goal is to just make them possible, i.e., functional.
EDA is about exploring your data to later augment it to maximize the potential of our analyses, visualizations, and models. When exploring, simple visualizations are often used to summarize your data's main characteristics. From there you can do things like remove outliers and create new and more descriptive features from existing data, also known as feature engineering . Or detect and remove outliers so your model's fit is better.
In practice, wrangling and EDA can and often do occur together, but we're going to separate them for teaching purposes.
ETL
You also may have heard of the extract-transform-load process also known as ETL . ETL differs from data wrangling in three main ways:
- The users are different
- The data is different
- The use cases are different
This article ( Data Wrangling Versus ETL: What’s the Difference? ) by Wei Zhang explains these three differences well.